List of all variables in the dataframe
## [1] "row_num" "id" "rank" "workers" "company"
## [6] "url" "state_l" "state_s" "city" "metro"
## [11] "growth" "revenue" "industry" "yrs_on_list"
Dimensions of the dataframe
## [1] 5000 14
Structure of dataframe with preview of data values
## 'data.frame': 5000 obs. of 14 variables:
## $ row_num : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ company : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
## $ url : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
## $ state_s : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
## $ city : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
## $ metro : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
## $ growth : num 158957 57348 55460 26043 20690 ...
## $ revenue : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list: int 2 1 1 1 1 2 2 1 1 1 ...
Explore factor variables and the different levels in State and Industry
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Puerto Rico" "Rhode Island" "South Carolina"
## [43] "South Dakota" "Tennessee" "Texas"
## [46] "Utah" "Vermont" "Virginia"
## [49] "Washington" "West Virginia" "Wisconsin"
## [1] "Advertising & Marketing" "Business Products & Services"
## [3] "Computer Hardware" "Construction"
## [5] "Consumer Products & Services" "Education"
## [7] "Energy" "Engineering"
## [9] "Environmental Services" "Financial Services"
## [11] "Food & Beverage" "Government Services"
## [13] "Health" "Human Resources"
## [15] "Insurance" "IT Services"
## [17] "Logistics & Transportation" "Manufacturing"
## [19] "Media" "Real Estate"
## [21] "Retail" "Security"
## [23] "Software" "Telecommunications"
## [25] "Travel & Hospitality"
Summary of the data set
## row_num id rank workers
## Min. : 0 Min. : 4 5000 : 1 Min. : 0
## 1st Qu.:1250 1st Qu.:19575 4999 : 1 1st Qu.: 24
## Median :2500 Median :23292 4998 : 1 Median : 50
## Mean :2500 Mean :20037 4997 : 1 Mean : 209
## 3rd Qu.:3749 3rd Qu.:25370 4996 : 1 3rd Qu.: 125
## Max. :4999 Max. :26620 4995 : 1 Max. :34219
## (Other):4994
## company url state_l
## (add)ventures : 1 @properties : 1 California: 694
## @Properties : 1 110-consulting: 1 Texas : 404
## 110 Consulting: 1 123stores : 1 New York : 335
## 123Stores : 1 180 : 1 Florida : 303
## 180 : 1 180fusion : 1 Virginia : 284
## 180Fusion : 1 1seocom : 1 Illinois : 238
## (Other) :4994 (Other) :4994 (Other) :2742
## state_s city metro growth
## CA : 694 New York : 178 New York City: 399 Min. : 42.45
## TX : 404 Chicago : 95 Washington DC: 316 1st Qu.: 84.21
## NY : 335 Atlanta : 94 Los Angeles : 274 Median : 151.72
## FL : 303 Austin : 87 Chicago : 224 Mean : 516.44
## VA : 284 San Diego: 80 Atlanta : 194 3rd Qu.: 347.65
## IL : 238 Houston : 76 Dallas : 169 Max. :158956.91
## (Other):2742 (Other) :4390 (Other) :3424
## revenue industry yrs_on_list
## Min. : 1953000 IT Services : 733 Min. : 1.000
## 1st Qu.: 4876791 Advertising & Marketing : 453 1st Qu.: 1.000
## Median : 10722077 Business Products & Services: 435 Median : 2.000
## Mean : 43058182 Health : 377 Mean : 2.744
## 3rd Qu.: 26952131 Software : 338 3rd Qu.: 4.000
## Max. :5528202691 Financial Services : 278 Max. :12.000
## (Other) :2386
Histogram of states where companies are located
Histograms of workers by count
First plot doesn’t have small enough binwidths to see the trend. Reduce binwidth shows a histogram plot that skews right. What happens to distribution if I perform a long10 transformation?
Transforming the long tail by taking the log10 of workers helps better understand the distribution of workers. The transformed workers distribution looks close to a normal distribution with a longer tail on the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 24 50 209 125 34220
Distribution of industry
Distribution of revenue
## [1] 1953000 5528202691
Distribution of growth
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.45 84.21 151.70 516.40 347.70 159000.00
The distribution of growth and revenue look really similar. Let’s try another type of plot to tease apart how the distributions differ. The frequency polygon plot better shows the diffent shapes of the distributions. The amount of growth is based on revenue generated so it is not surprising the two distributions are similar since they are highly correlated.
Many of the highest ranked companies are small businesses. This could be because smaller companies grow faster than big public companies. But it could also be that smaller companies are starting with smaller amounts of revenues. Absolute growth in dollars is different from percentage growth. For example, company with no revenue the previous year that gains some revenue the next year has infinite percentage growth. But this isn’t a good reflection on how much revenue the company is generating compared to another company that’s making more in absolute revenue but has a lower percentage growth.
I created two new variables, revenue 2013, calculated in terms of current revenue and percentage growth to derive last year’s revenue, and growth in dollars, which is revenue 2013 substracted from revenue 2014.
## [1] 123000 143853 153125 135000 373500 690697
## [1] 195517000 82496710 84923377 35158000 77278860 137286506
There is a limitation in my data set. Without data about resident populations in each state or city or metro area it is hard to determine whether the states with the highest number of growing companies have growing companies because there are more people living there or if there is something special about that state that fosters growth. Therefore, I looked for population data from the U.S. Census Bureau and found population estimates for 2010 to 2014. This works with the company data from 2014 with the reverse engineered revenue and growth numbers I calculated for 2013.
## Geographic_Area Census_April1 Estimate_Base Est_2010 Est_2011
## 1 United States 308,745,538 308,758,105 309,347,057 311,721,632
## 2 Northeast 55,317,240 55,318,348 55,381,690 55,635,670
## 3 Midwest 66,927,001 66,929,898 66,972,390 67,149,657
## 4 South 114,555,744 114,562,951 114,871,231 116,089,908
## 5 West 71,945,553 71,946,908 72,121,746 72,846,397
## 6 Alabama 4,779,736 4,780,127 4,785,822 4,801,695
## Est_2012 Est_2013 Est_2014
## 1 314,112,078 316,497,531 318,857,056
## 2 55,832,038 56,028,220 56,152,333
## 3 67,331,458 67,567,871 67,745,108
## 4 117,346,322 118,522,802 119,771,934
## 5 73,602,260 74,378,638 75,187,681
## 6 4,817,484 4,833,996 4,849,377
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:GGally':
##
## nasa
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] "state_l" "state_growth_dollar" "state_population2014"
## 'data.frame': 51 obs. of 3 variables:
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ state_growth_dollar : num 718058962 4966627 2379822839 71079926 18309472149 ...
## $ state_population2014: chr "4,849,377" "736,732" "6,731,484" "2,966,369" ...
I have two datasets. The original dataset is a list of the 5000 fastest growing private companies in 2014 in the U.S. from Inc. 5000. The second dataset I have is state population data from the Census Bureau. I have two resulting data frames: companies is the Inc. 5000 data set with new variables added, and state_growth is population data with additional variables.
The variables most interesting to explore are the growth in percentage and dollar amounts since the dataset from Inc. 5000 is specifically about the fastest growing private companies in the U.S. I am also very interested in the industry the companies are in. ### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? Revenue will be important way to understand growth. For example, a company with a small revenue will see greater gains in percentage growth than a company with larger revenue amount but the latter could have a much greater revenue and growth in absolute dollar amounts. So it is critical to interpret growth in light of revenue.
State population data is also important to better understand growth. A larger state might appear to have greater growth in absolute dollar amounts but that could be influenced by a greater population. Therefore investigating growth per capita can provide a fairer way to look at growth, especially from the point of view of smaller states.
I created 4 new variables from existing varibles across two datasets I created two new variables in the companies data frame: 1. revenue2013, 2. growth_dollar. I reverse engineered revenue from 2013 using revenue from 2014 and percentage growth. Then I substracted the 2013 revenue from 2014 revenue to get the growth_dollar.
I also created a new dataframe using the state population data from the census. In this dataframe, I added two other variables: 3. state_growth_dollar and 4. growth_per_capita. state_growth_dollar was calculated by grouping together states and summing the growth_dollar derived from the 2nd variable I created growth_dollar. The growth_per_capita variable was created by dividing growth_dollar by the state population.
The revenue, growth, and workers histograms all skewed right with a very long tail. I had to perform a log transformation to better understand the data. I performed a lot of tidying and adjusting to import and join the two data frames, including converting the population data to a numeric because the commas that separated the thousands place was causing the read.csv() command to import population numbers as characters. I needed population numbers to be numeric so I could perform division to calculate the growth_per_capita.
geom_boxplot, geom_point, geom_violin, geom_jitter with geom_rug, geom_point(stat = ‘summary’), geom_bin2d, geom_tile, geom_density2d, geom_point(alpha = 1/10, color = ‘gray’) + geom_line(stat = ‘summary’, fun.y = median), geom_point(alpha = 1/10, color = ‘gray’) + geom_step(stat = ‘summary’, fun.y = median) # Bivariate Plots Section
Which state has greatest revenue growth per capita in 2014?
## Warning in loop_apply(n, do.ply): Removed 22 rows containing missing values
## (geom_point).
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Warning in loop_apply(n, do.ply): Removed 61 rows containing missing values
## (geom_point).
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).
## 'data.frame': 5000 obs. of 16 variables:
## $ row_num : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
## $ rank : Ord.factor w/ 5000 levels "5000"<"4999"<..: 5000 4999 4998 4997 4996 4995 4994 4993 4992 4991 ...
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ company : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
## $ url : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
## $ state_l : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
## $ state_s : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
## $ city : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
## $ metro : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
## $ growth_percentage: num 158957 57348 55460 26043 20690 ...
## $ revenue2014 : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list : int 2 1 1 1 1 2 2 1 1 1 ...
## $ revenue2013 : num 123000 143853 153125 135000 373500 ...
## $ growth_dollar : num 195517000 82496710 84923377 35158000 77278860 ...
## 'data.frame': 5000 obs. of 7 variables:
## $ workers : int 227 191 145 62 92 50 129 130 264 11 ...
## $ growth_percentage: num 158957 57348 55460 26043 20690 ...
## $ revenue2014 : num 195640000 82640563 85076502 35293000 77652360 ...
## $ industry : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
## $ yrs_on_list : int 2 1 1 1 1 2 2 1 1 1 ...
## $ revenue2013 : num 123000 143853 153125 135000 373500 ...
## $ growth_dollar : num 195517000 82496710 84923377 35158000 77278860 ...
## workers growth_percentage revenue2014 industry
## 1 227 158956.91 195640000 Consumer Products & Services
## 2 191 57347.92 82640563 Food & Beverage
## 3 145 55460.16 85076502 Business Products & Services
## 4 62 26042.96 35293000 Software
## 5 92 20690.46 77652360 Telecommunications
## 6 50 19876.52 137977203 Energy
## yrs_on_list revenue2013 growth_dollar
## 1 2 123000 195517000
## 2 1 143853 82496710
## 3 1 153125 84923377
## 4 1 135000 35158000
## 5 1 373500 77278860
## 6 2 690697 137286506
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
##
## Pearson's product-moment correlation
##
## data: companies$revenue2014 and companies$growth_percentage
## t = 0.1213, df = 4998, p-value = 0.9035
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02600509 0.02943333
## sample estimates:
## cor
## 0.001715438
##
## Pearson's product-moment correlation
##
## data: companies$revenue2014 and companies$growth_dollar
## t = 208.4471, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9440788 0.9498019
## sample estimates:
## cor
## 0.9470155
##
## Pearson's product-moment correlation
##
## data: companies$revenue2013 and companies$growth_percentage
## t = -2.016, df = 4998, p-value = 0.04386
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.056179170 -0.000785593
## sample estimates:
## cor
## -0.02850427
##
## Pearson's product-moment correlation
##
## data: companies$revenue2013 and companies$growth_dollar
## t = 84.3823, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7548497 0.7777247
## sample estimates:
## cor
## 0.7665302
## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).
##
## Pearson's product-moment correlation
##
## data: companies$yrs_on_list and companies$revenue2014
## t = 10.7641, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1233182 0.1775016
## sample estimates:
## cor
## 0.150523
##
## Pearson's product-moment correlation
##
## data: companies$yrs_on_list and companies$revenue2013
## t = 12.0087, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1403959 0.1942809
## sample estimates:
## cor
## 0.1674635
## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_ydensity).
## Warning in loop_apply(n, do.ply): Removed 124 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 31 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 21 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 69 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 65 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 362 rows containing missing
## values (geom_point).
geom_boxplot, geom_point, geom_violin, geom_jitter with geom_rug, geom_point(stat = ‘summary’), geom_bin2d, geom_tile, geom_density2d, geom_point(alpha = 1/10, color = ‘gray’) + geom_line(stat = ‘summary’, fun.y = median), geom_point(alpha = 1/10, color = ‘gray’) + geom_step(stat = ‘summary’, fun.y = median) # Bivariate Analysis